Predicting Stock Market Volatility Using Graph Attention Networks

Capturing Inter-Stock Dynamics for Enhanced Accuracy

Authors

OPTIVER STREAM, GROUP 21:

Shreya Prakash (520496062)

Chenuka Garunsinghe (530080640)

Binh Minh Tran (530414672)

Enoch Wong (530531430)

Ruohai Tao (540222281)

Zoha Kausar (530526838)

1. Executive Summary

Realized volatility (RV), measuring stock price fluctuations, is vital for financial risk management but difficult to predict due to inter-stock dependencies. This study demonstrates that a Graph Attention Network (GAT) can model these relationships dynamically, achieving an RMSE of 1.236 x 10-3, surpassing traditional models like HAR-RV. The GAT enhances accuracy and interpretability by revealing sector-based influences. Deployed in the VoltaTrade Shiny app, this solution enables investors and stock traders to access real-time, interactive RV predictions, helping them assess risks, optimize portfolios, and refine trading strategies. Combining finance, artificial intelligence (deep learning), and data science, this interdisciplinary work delivers a scalable forecasting tool.

Innovations: This research introduces three key advancements:

  1. Utilization of a Graph Attention Network to model stock relationships dynamically, unlike static traditional methods.

  2. Incorporation of inter-stock dynamics to enhance forecasting accuracy through attention-based weighting.

  3. Deployment in a Shiny application, enabling user-friendly, real-time volatility analysis for broader accessibility.

2. Introduction

The rise of stock trading in the 2010s, amplified in the 2020s by zero-commission platforms and social media movements, has heightened the need for reliable forecasting tools (Governors of the Federal Reserve System, 2022). Realized volatility (RV), which quantifies stock price fluctuations over a specific period, is a cornerstone for risk management, derivative pricing, and portfolio optimization in financial markets (Andersen et al., 2003); (Hull, 2015).

However, most traditional models used to forecast volatility - such as GARCH or HAR-RV - assume linearity and treat stocks as independent entities, ignoring the non-linear dynamics and cross-sectional dependencies that govern real-world markets (Corsi, 2009); (Cont, 2001). This creates a performance ceiling in modern settings, where market behaviour is complex, high-frequency, and increasingly interlinked. This formed our research question:

Research Question
Can graph neural networks identify inter-stock relationships using market metrics to improve RV forecasting accuracy over traditional models?

To address this, we developed a graph-based deep learning framework that models the stock market as an evolving graph, where nodes represent stocks, and edges reflect historical similarities. Using a Graph Attention Network (GAT), we dynamically weigh inter-stock influences, enhancing prediction accuracy and interpretability (Velickovic et al., 2018); (Zhang et al., 2022)

We also deployed this GAT-based pipeline into a real-time Shiny web-application which empowers individual investors and quantitative-traders by allowing them to interact with volatility predictions, visually explore inter-stock relationships, and filter based on configurable metrics - making a traditionally opaque process more transparent, interpretable, and user-centric.

Inspired by the Kaggle Optiver Realized Volatility Prediction Challenge (Optiver, 2021), we focus on building a practical, and scalable end-to-end volatility prediction system that integrates cross-disciplinary theory of -

  • Market behaviour and volatility mechanisms from Finance.

  • Modelling of complex non-linear dependencies from Deep Learning (AI).

  • Full modelling life-cycle, from data cleaning to real-time deployment from Data Science.

3. Methodology

Overview of Data Analysis Process

3.1 Data Pre-processing

3.1.1 Optiver LOB dataset

Show Code
# list of all imports
from glob import glob
import pandas as pd, os
import numpy as np
import torch
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from itertools import product
import warnings
from pandas.errors import PerformanceWarning

# Suppress fragmentation warnings from pandas
warnings.filterwarnings("ignore", category=PerformanceWarning)

# list all book files and the target table
real_volatility = pd.read_csv('real_volatility.csv')

book_paths = sorted(glob("individual_book_train/stock_*.csv"))

df_files = (pd.DataFrame({"path": book_paths})
              .assign(stock_id=lambda d: d["path"].astype(str)
              .str.extract(r'(\d+)').astype(int)))
              
device = torch.device('mps' if torch.mps.is_available() else 'cpu') # can change mps to cuda for non metal devices 

We use the Optiver limit-order-book (LOB) training set: 120 stocks × 3830 ten-minute buckets, each bucket containing 600 one-second snapshots of the top-of-book. The public train.csv file provides realized-volatility targets; all other fields come from the per-stock book_*.csv tables.

3.2.2 De-normalization & cleaning

For every stock ID and time ID we forward-fill the 600 snapshots, compute the smallest non-zero jump \(\delta P\), and rescale the whole bucket by \(\frac{0.01}{\delta P}\), thereby recovering genuine dollar prices.

A quick histogram of rescaled jumps shows only integer ticks, confirming the fix (Appendix)

Show Code
def denorm_scale(grp, col="ask_price1"):
    grp = (grp.set_index("seconds_in_bucket")
              .reindex(range(600), method="ffill")
              .reset_index())
    tick = grp[col].diff().abs().loc[lambda x: x>0].min()
    return (grp[col] * (0.01 / tick)).mean()       # bucket opening price

prices = []
for _, row in df_files.iterrows():
    df = pd.read_csv(row.path, usecols=["time_id", "seconds_in_bucket", "ask_price1"])
    s  = df.groupby("time_id").apply(denorm_scale, include_groups=False).rename(row.stock_id)
    prices.append(s)

df_prices = pd.concat(prices, axis=1)   # rows: time_id, cols: stock_id

Missing data is rare; when a stock misses > 0.05 % of seconds on any trading day (≈ 44/88 200) we drop it. For the remaining series we Winsorise at the 0.1 % / 99.9 % quantiles to suppress single-tick glitches.

Because the time-id values are shuffled, we restore chronology with a 1-D spectral embedding: treat each bucket as a point in ℝ¹²⁰, embed with the leading Laplacian eigen-vector, and sort. Applied to S&P-100 closes this technique reproduces calendar order perfectly, so we adopt it here.

Show Code
from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import minmax_scale
import yfinance as yf

def spectral_order(df, k=30, seed=42):
    """
    Return index sorted by the leading spectral coordinate.
    df : (n_buckets × n_stocks) price matrix with *no* NaNs.
    """
    df_clean = df.fillna(df.mean())
    X = minmax_scale(df_clean.values) # normalise
  
    emb_2d = SpectralEmbedding(random_state=seed).fit_transform(X)
    coord = emb_2d[:, 0]
    return df.index[coord.argsort()]

THRESHOLD = 0.0005
keep = df_prices.isna().mean().le(0.0005)
df_prices = df_prices.loc[:, keep]

# winsorise
q_lo, q_hi = df_prices.quantile(0.001), df_prices.quantile(0.999)
df_prices_denorm_clean = df_prices.clip(lower=q_lo, upper=q_hi, axis=1).ffill().bfill()
time_id_ordered = spectral_order(df_prices_denorm_clean)
df_prices_ordered = df_prices_denorm_clean.reindex(time_id_ordered)
Show Code
import contextlib
import sys
import io
import warnings

from sklearn.manifold import SpectralEmbedding
from sklearn.preprocessing import minmax_scale
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

f = io.StringIO()
with warnings.catch_warnings(), contextlib.redirect_stdout(f), contextlib.redirect_stderr(f):
    warnings.simplefilter("ignore")
    
    # Download S&P-100 data
    sp100 = pd.read_html("https://en.wikipedia.org/wiki/S%26P_100")[2].Symbol
    sp100 = sp100.str.replace('.', '-', regex=False)
    df_real = (yf.download(sp100.to_list(), start="2020-01-01", end="2021-06-01", interval="1d")['Close']
                 .dropna(axis=1, thresh=0.5*len(sp100))
                 .dropna())
                           

# Embed both matrices in 2-D for eyeballing
embed = SpectralEmbedding(n_components=2, random_state=42)
Z_denorm = embed.fit_transform(minmax_scale(df_prices_ordered.values))
Z_real = embed.fit_transform(minmax_scale(df_real.values))


import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(16, 6))

# Optiver Buckets
sc0 = ax[0].scatter(Z_denorm[:, 0], Z_denorm[:, 1],
                    c=np.arange(len(Z_denorm)), cmap='viridis', s=8)
cb0 = fig.colorbar(sc0, ax=ax[0], shrink=0.8, pad=0.03)
cb0.set_label("Spectral Order Index", fontsize=11, labelpad=10)

_ = ax[0].set_title("Optiver Buckets\n(Colour = Spectral Order)", fontsize=13)
_ = ax[0].set_xlabel("Spectral Dimension 1", fontsize=11)
_ = ax[0].set_ylabel("Spectral Dimension 2", fontsize=11)

# S&P-100 Daily
sc1 = ax[1].scatter(Z_real[:, 0], Z_real[:, 1],
                    c=mdates.date2num(df_real.index), cmap='viridis', s=8)
cb1 = fig.colorbar(sc1, ax=ax[1], shrink=0.8, pad=0.03)
cb1.set_label("TimeID order (Calender Data Progression)", fontsize=11, labelpad=10)

_ = ax[1].set_title("S&P-100 Daily\n(Colour = Calendar Date)", fontsize=13)
_ = ax[1].set_xlabel("Spectral Dimension 1", fontsize=11)
_ = ax[1].set_ylabel("Spectral Dimension 2", fontsize=11)

# Figure title and spacing
_ = fig.suptitle("Spectral Embedding of Optiver vs S&P-100 Stock Closes", fontsize=16, y=1.02)
fig.subplots_adjust(wspace=0.25, top=0.85)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

Finally, we turn prices into stationary log-returns:

log-diff RV = log Pₜ − log Pₜ₋₁. After this transform the Zivot–Andrews test fails to reject stationarity for every stock, giving us a stable target for modelling. The resulting panel (T − 1 = 3 829 × N = 120) underpins all subsequent feature engineering.

Show Code

#| echo: true
#| warning: false
#| message: false
#| output: true

def compute_rv(grp):
    """Compute realized volatility from intraday price data"""
    if all(col in grp.columns for col in ['bid_price1', 'ask_price1', 'bid_size1', 'ask_size1']):
        wap = (grp['bid_price1'] * grp['ask_size1'] + 
               grp['ask_price1'] * grp['bid_size1']) / \
              (grp['bid_size1'] + grp['ask_size1'])
    else:
        wap = grp['ask_price1']
        
    log_returns = np.log(wap).diff().dropna()
    rv = np.sqrt((log_returns ** 2).sum())
    return rv

# RV for each stock and time_id combination
rv_records = []
for _, row in df_files.iterrows():
    try:
        dfb = pd.read_csv(row.path)
        rv_series = (dfb.groupby('time_id')
                       .apply(compute_rv, include_groups=False)
                       .rename('rv')
                       .reset_index())
        rv_series['stock_id'] = row.stock_id
        rv_records.append(rv_series)
    except Exception as e:
        print(f"Error processing {row.path}: {e}")
        continue

rv_df = pd.concat(rv_records, ignore_index=True)

# map to bucket_idx using time ordering
time_map = pd.DataFrame({'time_id': time_id_ordered})
time_map['bucket_idx'] = range(len(time_map))
rv_df = rv_df.merge(time_map, on='time_id')
rv_pivot = rv_df.pivot(index='bucket_idx', columns='stock_id', values='rv')
avg_rv = rv_pivot.mean(axis=1)
Show Code
# Stationarity tests
from statsmodels.tsa.stattools import zivot_andrews

def test_stationarity(ts, name="Series"):
    """Test stationarity using Zivot-Andrews test (better for structural breaks)"""
    ts_clean = ts.dropna()
    
    try:
        za_stat, za_pval, za_cv, za_lag, za_bpidx = zivot_andrews(
            ts_clean.values,
            regression='ct',  # break in both intercept and trend
            trim=0.15,
            maxlag=12,
            autolag='AIC'
        )
        
        is_stationary = za_stat < za_cv['5%']
        status = '✓ Stationary' if is_stationary else '✗ Non-stationary'
        
        print(f"{name:15} | ZA: {za_stat:6.3f} (p={za_pval:.3f}) | {status}")
        print(f"{'':15} | Break at index: {za_bpidx} | 5% crit: {za_cv['5%']:6.3f}")
        
        return is_stationary
        
    except Exception as e:
        print(f"{name:15} | ZA test failed: {str(e)}")
        return False

test_stationarity(avg_rv, "Average RV by Time")
Average RV by Time | ZA: -10.702 (p=0.001) | ✓ Stationary
                | Break at index: 3024 | 5% crit: -5.073
np.True_
Show Code

from statsmodels.tsa.stattools import zivot_andrews

log_rv = np.log(avg_rv + 1e-8)
stationary_rv = log_rv.diff().dropna()

Now that this transformation has worked as seen on the transformed volatility by time id plot we apply this transformation for each individual stock across its time_id

Show Code
import matplotlib.pyplot as plt
import numpy as np

def plot_volatility_transformation(avg_rv, transformed_data, transform_name="Log + Diff"):  
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    axes[0].plot(avg_rv.index, avg_rv.values, color='blue', linewidth=0.8)
    axes[0].set_title('Original Average Realised Volatility Over Time', fontsize=14)
    axes[0].set_ylabel('Average RV', fontsize=12)
    axes[0].grid(True, alpha=0.3)

    #! start from index 1 after log diff transform 
    transform_index = avg_rv.index[1:len(transformed_data)+1]
    
    axes[1].plot(transform_index, transformed_data.values, color='green', linewidth=0.8)
    axes[1].set_title(f'Transformed Volatility ({transform_name}) - Stationary', fontsize=14)
    axes[1].set_ylabel(f'Transformed RV ({transform_name})', fontsize=12)
    axes[1].set_xlabel('Time id', fontsize=12)
    axes[1].grid(True, alpha=0.3)
    axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5)
    
    plt.tight_layout()
    plt.show()

plot_volatility_transformation(avg_rv, stationary_rv, "Log + First Diff")

Show Code
import numpy as np
import pandas as pd

def create_stationary_features_fixed(rv_pivot):
    if isinstance(rv_pivot, pd.DataFrame):
        rv_matrix = rv_pivot.values
    else:
        rv_matrix = rv_pivot
    
    T, N = rv_matrix.shape
    print(f"Input matrix: {T} time points × {N} stocks")
    log_rv = np.log(rv_matrix + 1e-8)
    stationary_rv = np.diff(log_rv, axis=0)  # Shape: (T-1, N)
  
    stationary_rv = np.nan_to_num(stationary_rv, nan=0.0, posinf=0.0, neginf=0.0)    
    return stationary_rv

transformed_rv_pivot = create_stationary_features_fixed(rv_pivot)
Input matrix: 3830 time points × 112 stocks

3.2 Feature engineering

We need to enrich the transformed_rv_pivot () where T is the number of time buckets and N is the number of stocks with more detail to increase the information gain of the data for the GAT model to learn both temporal pattern and short term fluctuations. In order to achieve this we proceeded with the following features (See appendix for more detail on features):

  1. Own lags (3) – Rvᵢ,ₜ₋₁ … Rvᵢ,ₜ₋₃ capture short-range persistence.

  2. Momentum (1) – mean(Rvᵢ,ₜ₋₃:ₜ₋₁) − mean(Rvᵢ,ₜ₋₆:ₜ₋₄) flags acceleration.

  3. Mean-reversion (1) – −(Rvᵢ,ₜ − μ₁₀) where μ₁₀ is the 10-bucket rolling mean.

  4. Vol-of-vol (1) – σ₅, the std-dev of the last five buckets.

Each feature is z-scored per stock and stacked into Xₜ ∈ ℝᴺ×⁶.

All calculation of these is done per stock.

Show Code
def build_initial_features(stationary_rv):
    T, N = stationary_rv.shape
    features_list = []
    
    # 1. Own-RV lags
    lags = [1, 2, 3]
    for lag in lags:
        lag_feat = np.zeros((T, N))
        if lag < T:
            lag_feat[lag:, :] = stationary_rv[:-lag, :]
        features_list.append(lag_feat[:, :, np.newaxis])
    
    # 2. Volatility momentum
    vol_momentum = np.zeros((T, N, 1))
    for t in range(6, T):
        recent = np.mean(stationary_rv[t-3:t], axis=0)
        past = np.mean(stationary_rv[t-6:t-3], axis=0)
        vol_momentum[t, :, 0] = recent - past
    features_list.append(vol_momentum)
    
    # 3. Mean reversion tendency
    mean_reversion = np.zeros((T, N, 1))
    for t in range(10, T):
        window = stationary_rv[t-10:t]
        mean_val = np.mean(window, axis=0)
        current_deviation = stationary_rv[t] - mean_val
        mean_reversion[t, :, 0] = -current_deviation  # the tendency to revert
    features_list.append(mean_reversion)
    
    # 4. Volatility of volatility
    vol_of_vol = np.zeros((T, N, 1))
    for t in range(5, T):
        window_std = np.std(stationary_rv[t-5:t], axis=0)
        vol_of_vol[t, :, 0] = window_std
    features_list.append(vol_of_vol)
    
    X_initial = np.concatenate(features_list, axis=2)  # Shape: (T, N, 6)
    X_initial = np.nan_to_num(X_initial, nan=0.0, posinf=0.0, neginf=0.0)
    
    print(f"Initial features: {X_initial.shape} (T × N × {X_initial.shape[2]} features)")
    return X_initial

X_initial_features = build_initial_features(transformed_rv_pivot)
Initial features: (3829, 112, 6) (T × N × 6 features)

3.3 Graph construction & GAT architecture

In this section we describe how we turned the above normalized price matrix into a static stock stock-neighbour graph, assemble node-level features and the apply a two layer Graph Attention Network for bucket‐by‐bucket volatility forecasting.

3.3.1 Building the neighbour graph and GAT

What & why. For every stock we embed the most-recent 50-bucket price signature, this was done because stock relationships change over time! Stocks that moved together 2 years ago might not move together now (A logical assumption we made). and use a KD-tree to find its K=3 nearest neighbors. This captures current co-movement, recognizing that relationships drift over time.

Returns. A PyG-ready edge_index (source–destination pairs) and exponentially decaying edge_weight, plus the raw neighbor matrix and the list of stocks that survived NaN screening and with this, we captured 88% of the Optiver universe within the data

Show Code
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KDTree
import torch
import numpy as np

def build_graph_on_features(X_features, time_window=50, K=3):
    T, N, F = X_features.shape
    
    # Use recent time window for similarity (stocks change over time)
    if time_window < T:
        recent_features = X_features[-time_window:, :, :]  # Last 50 time steps
    else:
        recent_features = X_features
        
    X_for_graph = recent_features.transpose(1, 0, 2).reshape(N, -1)
    
    # Remove stocks with missing features
    valid_stocks = ~np.isnan(X_for_graph).any(axis=1)
    X_clean = X_for_graph[valid_stocks]
    valid_indices = np.where(valid_stocks)[0]
    print(f"   Valid stocks: {len(valid_indices)} / {len(valid_stocks)}")
    
    # Min-Max scale feature space and build tree
    X_scaled = MinMaxScaler().fit_transform(X_clean)
    tree = KDTree(X_scaled, metric='euclidean')
    dist, nbr_raw = tree.query(X_scaled, k=K+1)  # includes self
    
    # mapping back to original indices
    nbr = valid_indices[nbr_raw]
    src = np.repeat(valid_indices, K)
    dst = nbr[:, 1:].ravel()
    edge_index = torch.tensor([src, dst], dtype=torch.long)
    edge_weight = torch.exp(-torch.tensor(dist[:, 1:].ravel(), dtype=torch.float))
    
    return edge_index, edge_weight, nbr, valid_indices

edge_index, edge_weight, neighbor_indices, valid_indices = build_graph_on_features(
    X_initial_features, time_window=50, K=3
)
   Valid stocks: 112 / 112
Show Code
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import torch

def plot_kdtree_network(edge_index, edge_weight, valid_indices):
    if isinstance(edge_index, torch.Tensor):
        edge_index = edge_index.cpu().numpy()
    if isinstance(edge_weight, torch.Tensor):
        edge_weight = edge_weight.cpu().numpy()
    
    G = nx.Graph()
    for i, stock_idx in enumerate(valid_indices):
        G.add_node(i, stock_id=stock_idx)
    
    sources = edge_index[0]
    targets = edge_index[1]
    
    for i in range(len(sources)):
        src = sources[i]
        tgt = targets[i]
        weight = edge_weight[i]
        G.add_edge(src, tgt, weight=weight)
    
    plt.figure(figsize=(10, 8))
    pos = nx.spring_layout(G, k=3, iterations=50, seed=42)
    nx.draw_networkx_nodes(G, pos, 
                          node_color='lightblue',
                          node_size=1000,
                          linewidths=2,
                          edgecolors='darkblue')
    
    edges = G.edges()
    weights = [G[u][v]['weight'] for u, v in edges]
    
    # Normalize weights for edge thickness
    max_weight = max(weights) if weights else 1
    edge_widths = [3 * (w / max_weight) for w in weights]
    
    nx.draw_networkx_edges(G, pos,
                          width=edge_widths,
                          alpha=0.7,
                          edge_color='gray')
    
    labels = {i: f'{valid_indices[i]}' for i in range(len(valid_indices))}
    nx.draw_networkx_labels(G, pos, labels, font_size=10, font_weight='bold')
    
    plt.title('Stock Relationship Network (KD-Tree)', fontsize=14, fontweight='bold')
    plt.axis('off')
    plt.figtext(0.02, 0.02, 
                f'Nodes: {len(valid_indices)}\nEdges: {len(edges)}\nAvg Weight: {np.mean(weights):.3f}',
                fontsize=10, bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgray"))
    
    plt.tight_layout()
    plt.show()
    return G

What & why. Using the neighbor matrix below, we compute for every bucket t the average log-diff RV of each stock’s three neighbours and stack it with the six temporal channels built in § 4.1. This single cross-sectional feature gives the GAT some market “consensus” signal without inflating the feature dimension. We add the six temporal channels and the one neighbour channel to X_transformed converting it to a tensor for later training.

Show Code
def build_neighbour_feats(stationary_rv, neighbor_indices):
    """
    Build features properly with correct dimensions
    """
    T, N = stationary_rv.shape
  
    # Neighbor mean RV (1 feature)
    nei_mean = np.zeros((T, N, 1))
    for i in range(N):
        if i < len(neighbor_indices) and len(neighbor_indices[i]) > 1:
            nei_idx = neighbor_indices[i, 1:]  # exclude self
            nei_idx = nei_idx[nei_idx < N]  # ensure valid indices
            if len(nei_idx) > 0:
                nei_mean[:, i, 0] = stationary_rv[:, nei_idx].mean(axis=1)
        
    # Clean up
    X = np.nan_to_num(nei_mean, nan=0.0, posinf=0.0, neginf=0.0)
    
    print(f"Features created: {X.shape} (T × N × {X.shape[2]} features)")
    
    return X, stationary_rv

X_transformed, y_transformed = build_neighbour_feats(transformed_rv_pivot, neighbor_indices)
Features created: (3829, 112, 1) (T × N × 1 features)
Show Code

X_tensor_transformed = torch.tensor(X_transformed, dtype=torch.float)
y_tensor_transformed = torch.tensor(y_transformed, dtype=torch.float)

Then to learn the dynamic weightings of the neighbours we defined a simple 2-layer GAT that lets each stock fuse its own history with weighted neighbour signals.

Show Code
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv

class ImprovedVolatilityGAT(torch.nn.Module):
    def __init__(self, in_feats, hidden=64, heads=4, dropout=0.3):
        super().__init__()
        self.dropout = dropout
        
        # Smaller, more stable architecture
        self.conv1 = GATConv(in_feats, hidden, heads=heads, dropout=dropout, concat=True)
        self.conv2 = GATConv(hidden * heads, hidden // 2, heads=2, dropout=dropout, concat=True)
        self.conv3 = GATConv(hidden, 1, heads=1, concat=False, dropout=dropout)
        
        # Batch normalization for stability
        self.bn1 = torch.nn.BatchNorm1d(hidden * heads)
        self.bn2 = torch.nn.BatchNorm1d(hidden)
        
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Better weight initialization"""
        for m in self.modules():
            if isinstance(m, torch.nn.Linear):
                torch.nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
    
    def forward(self, x, edge_index):
        # Layer 1
        h = self.conv1(x, edge_index)
        h = self.bn1(h) if h.size(0) > 1 else h  # Skip BN for single samples
        h = F.elu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Layer 2
        h = self.conv2(h, edge_index)
        h = self.bn2(h) if h.size(0) > 1 else h
        h = F.elu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Output layer
        h = self.conv3(h, edge_index)
        return h.squeeze(-1)

3.3.2 Training, Loss and Cross Validation

We decided to use a hybrid approach for an optimum loss function. We first scale the means squared error by the target series own variance which will give a unit free % of variance magnitude penalty, then mix with a directional penalty that rises when the model get the sign of the volatility change wrong. We used a default weight of \(\alpha = 0.8\) so the network is rewarded for predicting both how big and which way volatility moves—ideal for trading settings.

\[ 0.8 \times \text{relative-MSE} \;+\; 0.2 \times (1-\text{direction-accuracy}) \]

Why 0.8: (Yin, 2023) introduce a *direction-integrated MSE* for stock forecasting and test λ ∈ {0.6–0.9}; their best models cluster at λ≈0.8. This was found from our literature review which we used as guidance here.

Show Code
def relative_mse_percentage_loss(y_pred, y_true):
    mse = F.mse_loss(y_pred, y_true)
    target_var = torch.var(y_true) + 1e-8  # small epsilon for stability
    return (mse / target_var) * 100

def hybrid_volatility_loss(y_pred, y_true, alpha=0.8):
    # Magnitude component (relative MSE)
    magnitude_loss = relative_mse_percentage_loss(y_pred, y_true)
    
    # Direction component
    pred_direction = torch.sign(y_pred)
    true_direction = torch.sign(y_true)
    direction_accuracy = torch.mean((pred_direction == true_direction).float())
    direction_loss = (1 - direction_accuracy) * 100
    
    return alpha * magnitude_loss + (1 - alpha) * direction_loss

Training hyper-parameters were chosen by a small expanding-window grid search (Appendix D) and fall squarely within values recommended by recent GAT studies: Adam with \(\text{lr} = 1 \times 10^{-3}\) and \(\text{weight-decay} = 1 \times 10^{-4}\) gave the lowest mean CV loss; \(\text{dropout} = 0.20\) and gradient-clipping at \(|g|_2 \leq 1.0\) eliminated over-fitting and exploding gradients; a ReduceLROnPlateau scheduler (factor \(0.7\), patience \(5\)) and early-stopping (patience \(15\), \(\delta = 1 \times 10^{-6}\)) cut training time by \(\sim 30%\) without degrading validation loss. Together these settings provide numerically stable, reproducible training while matching the error profiles demanded by our hybrid loss.

Show Code
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv
from torch_geometric.data import Data, DataLoader
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

def train_gat_fixed(X, y, edge_index, edge_weight, device='cpu'):
    T, N, feat_dim = X.shape
    X_tensor = torch.tensor(X, dtype=torch.float)
    y_tensor = torch.tensor(y, dtype=torch.float)
    
    # making graph snapshots
    graph_list = []
    for t in range(T):
        graph_list.append(Data(
            x=X_tensor[t],           # [N, feat_dim]
            edge_index=edge_index,   # [2, E]
            edge_weight=edge_weight, # E
            y=y_tensor[t]           # N
        ))
    
    print(f"Created {len(graph_list)} graph snapshots")

    # time series split for CV, k = 4
    split = 4
    tscv = TimeSeriesSplit(n_splits=split, test_size=int(0.2 * T), gap=10)
    
    fold_results = []
    best_overall_val_loss = float('inf')
    best_model_state = None
    
    for fold, (train_idx, val_idx) in enumerate(tscv.split(range(T))):
        print(f"\n{'='*20} Fold {fold+1}/{split} {'='*20}")
        
        train_graphs = [graph_list[i] for i in train_idx]
        val_graphs = [graph_list[i] for i in val_idx]
        train_loader = DataLoader(train_graphs, batch_size=8, shuffle=True)
        val_loader = DataLoader(val_graphs, batch_size=8, shuffle=False)
        
        model = ImprovedVolatilityGAT(
            in_feats=feat_dim, 
            hidden=32,  
            heads=2,
            dropout=0.2
        ).to(device)
        
        # optimizer with lower learning rate
        optimizer = torch.optim.Adam(
            model.parameters(), 
            lr=1e-3,
            weight_decay=1e-4
        )

        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.7, patience=5
        )
        
        best_val_loss = float('inf')
        patience_counter = 0
        patience = 15
        
        train_losses = []
        val_losses = []
        
        for epoch in range(1, 101):  # 100 epochs max
            model.train()
            train_loss = 0.0
            train_count = 0
            
            for batch in train_loader:
                batch = batch.to(device)
                optimizer.zero_grad()
                
                try:
                    preds = model(batch.x, batch.edge_index)
                    loss = hybrid_volatility_loss(preds, batch.y)
                    
                    # Check for NaN loss
                    if torch.isnan(loss) or torch.isinf(loss):
                        print(f"!! NaN/Inf loss detected at epoch {epoch}!!!")
                        break
                    
                    loss.backward()
                    
                    # Gradient clipping
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                    
                    optimizer.step()
                    train_loss += loss.item()
                    train_count += 1
                    
                except Exception as e:
                    print(f"Training error at epoch {epoch}: {e}")
                    break
            
            if train_count == 0:
                print("Training failed - breaking")
                break
                
            train_loss /= train_count
            
            # Validation
            model.eval()
            val_loss = 0.0
            val_count = 0
            
            with torch.no_grad():
                for batch in val_loader:
                    batch = batch.to(device)
                    try:
                        preds = model(batch.x, batch.edge_index)
                        loss = hybrid_volatility_loss(preds, batch.y)
                        
                        if not (torch.isnan(loss) or torch.isinf(loss)):
                            val_loss += loss.item()
                            val_count += 1
                    except Exception as e:
                        print(f"Validation error: {e}")
                        continue
            
            if val_count == 0:
                print("Validation failed - breaking")
                break
                
            val_loss /= val_count
            
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            
            if epoch % 5 == 0 or epoch < 10:
                print(f"Epoch {epoch:03d} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")
            
            scheduler.step(val_loss)
            
            # Early stopping
            if val_loss < best_val_loss - 1e-6:
                best_val_loss = val_loss
                patience_counter = 0
                
                # Check if this is the best model across all folds
                if val_loss < best_overall_val_loss:
                    best_overall_val_loss = val_loss
                    best_model_state = model.state_dict()
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print(f"→ Early stopping at epoch {epoch}")
                    break
        
        fold_results.append({
            'fold': fold,
            'best_val_loss': best_val_loss,
            'train_losses': train_losses,
            'val_losses': val_losses,
            'final_epoch': epoch
        })
        
        print(f"Fold {fold+1} complete - Best Val Loss: {best_val_loss:.6f}")
    
    # Save only the best model across all folds
    if best_model_state is not None:
        torch.save({
            'model_state_dict': best_model_state,
            'val_loss': best_overall_val_loss,
        }, 'best_gat_model.pt')
        print(f"Saved best model with validation loss: {best_overall_val_loss:.6f}")
    
    return fold_results
Show Code
#| echo: false
#| warning: false
#| message: false
#| output: false

device = torch.device('cpu')
#device = torch.device('mps' if torch.mps.is_available() else 'cpu') # can change mps to cuda for non metal devices
X_initial_features = build_initial_features(transformed_rv_pivot)
Initial features: (3829, 112, 6) (T × N × 6 features)
Show Code
X_neighbor_features, y_transformed = build_neighbour_feats(transformed_rv_pivot, neighbor_indices)
Features created: (3829, 112, 1) (T × N × 1 features)
Show Code
# ADD THIS LINE:
X_complete = np.concatenate([X_initial_features, X_neighbor_features], axis=2)

# UPDATE YOUR TRAINING:
results = train_gat_fixed(X_complete, y_transformed, edge_index, edge_weight, device)
Created 3829 graph snapshots

==================== Fold 1/4 ====================
Epoch 001 | Train Loss: 200.649730 | Val Loss: 26.852799
Epoch 002 | Train Loss: 92.789128 | Val Loss: 22.459841
Epoch 003 | Train Loss: 65.883815 | Val Loss: 21.658854
Epoch 004 | Train Loss: 64.091668 | Val Loss: 21.445783
Epoch 005 | Train Loss: 51.462945 | Val Loss: 20.700390
Epoch 006 | Train Loss: 53.957520 | Val Loss: 21.002972
Epoch 007 | Train Loss: 50.321879 | Val Loss: 22.260988
Epoch 008 | Train Loss: 48.328128 | Val Loss: 22.278279
Epoch 009 | Train Loss: 48.684065 | Val Loss: 23.196625
Epoch 010 | Train Loss: 45.562866 | Val Loss: 23.520708
Epoch 015 | Train Loss: 45.958737 | Val Loss: 21.157582
Epoch 020 | Train Loss: 46.348625 | Val Loss: 23.966534
→ Early stopping at epoch 20
Fold 1 complete - Best Val Loss: 20.700390

==================== Fold 2/4 ====================
Epoch 001 | Train Loss: 122.854026 | Val Loss: 20.364018
Epoch 002 | Train Loss: 59.913582 | Val Loss: 18.359378
Epoch 003 | Train Loss: 52.182778 | Val Loss: 20.246077
Epoch 004 | Train Loss: 48.463108 | Val Loss: 20.532450
Epoch 005 | Train Loss: 45.811784 | Val Loss: 21.533562
Epoch 006 | Train Loss: 48.378365 | Val Loss: 19.309339
Epoch 007 | Train Loss: 45.430337 | Val Loss: 22.909349
Epoch 008 | Train Loss: 46.546595 | Val Loss: 18.727573
Epoch 009 | Train Loss: 43.869950 | Val Loss: 20.020529
Epoch 010 | Train Loss: 43.585746 | Val Loss: 18.479385
Epoch 015 | Train Loss: 46.444455 | Val Loss: 18.725038
→ Early stopping at epoch 17
Fold 2 complete - Best Val Loss: 18.359378

==================== Fold 3/4 ====================
Epoch 001 | Train Loss: 106.315023 | Val Loss: 19.572664
Epoch 002 | Train Loss: 55.242083 | Val Loss: 20.414932
Epoch 003 | Train Loss: 48.832731 | Val Loss: 20.993962
Epoch 004 | Train Loss: 46.765693 | Val Loss: 19.511531
Epoch 005 | Train Loss: 45.200303 | Val Loss: 17.936584
Epoch 006 | Train Loss: 44.622400 | Val Loss: 22.667548
Epoch 007 | Train Loss: 48.529855 | Val Loss: 21.941148
Epoch 008 | Train Loss: 45.271290 | Val Loss: 21.713007
Epoch 009 | Train Loss: 46.485565 | Val Loss: 24.288564
Epoch 010 | Train Loss: 48.380246 | Val Loss: 23.715198
Epoch 015 | Train Loss: 52.644284 | Val Loss: 31.112871
Epoch 020 | Train Loss: 42.124615 | Val Loss: 20.992667
→ Early stopping at epoch 20
Fold 3 complete - Best Val Loss: 17.936584

==================== Fold 4/4 ====================
Epoch 001 | Train Loss: 93.619021 | Val Loss: 20.950847
Epoch 002 | Train Loss: 51.255651 | Val Loss: 20.642565
Epoch 003 | Train Loss: 48.570006 | Val Loss: 21.991479
Epoch 004 | Train Loss: 45.393690 | Val Loss: 21.152562
Epoch 005 | Train Loss: 44.690328 | Val Loss: 23.813042
Epoch 006 | Train Loss: 44.173752 | Val Loss: 23.071862
Epoch 007 | Train Loss: 44.100234 | Val Loss: 21.371778
Epoch 008 | Train Loss: 44.106165 | Val Loss: 25.592243
Epoch 009 | Train Loss: 45.393524 | Val Loss: 22.270522
Epoch 010 | Train Loss: 43.623919 | Val Loss: 21.924243
Epoch 015 | Train Loss: 45.763009 | Val Loss: 22.944729
→ Early stopping at epoch 17
Fold 4 complete - Best Val Loss: 20.642565
Saved best model with validation loss: 17.936584

3.4 Baseline models

To benchmark the performance of the GAT model, diverse baseline models were used including both traditional models and machine learning approaches. Log transformation of volatility was used across all baseline models to stabilize variance and improve model interpretability. These baseline models were served to evaluate whether GAT could capture patterns better than other models.

Show Code

# Defining evaluation metrics functions
def metric(prediction, actual):
    prediction, actual = np.array(prediction), np.array(actual)
    eps = 1e-8
    Prediction = np.maximum(prediction, eps)
    Actual = np.maximum(actual, eps)
    RMSE = np.sqrt(np.mean((Prediction - Actual) ** 2))
    QLIKE = np.mean(Actual / Prediction - np.log(Actual / Prediction) - 1)
    RMPSE = np.sqrt(np.mean(np.square((actual - prediction) / actual))) * 100
    MAPE = np.mean(np.abs((actual - prediction) / actual)) * 100
    return RMSE, QLIKE, RMPSE, MAPE

3.4.1 HAR-RV

Show Code
# lists to store result
har_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best parameters for each stock
best_params_dict = {}

# Hyperparameter Tuning
lag_matrix = {
    'short': [1],
    'medium': [5, 7],
    'long': [22, 30]
}

lag_matrix = list(product(lag_matrix['short'], lag_matrix['medium'], lag_matrix['long']))

# go through each unique stock
for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    # Initialise the best performance tracker
    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    best_params = None  

    # Grid Search each combination
    for short, medium, long in lag_matrix:
        har_data = pd.DataFrame({'volatility': stock_series})      
        har_data['rv_short'] = stock_series.shift(short)
        har_data['rv_medium'] = stock_series.rolling(window=medium).mean().shift(1)
        har_data['rv_long'] = stock_series.rolling(window=long).mean().shift(1)
        har_data['trend'] = np.arange(len(stock_series))
        har_data = har_data.dropna()

        # Train-Validation-Test split
        number = len(har_data)
        train_size = int(number * 0.8)
        validation_size = int(number * 0.1)

        har_train = har_data.iloc[:train_size]
        har_validation = har_data.iloc[train_size:train_size + validation_size]
        har_test = har_data.iloc[train_size + validation_size:]

        feature_columns = ['rv_short', 'rv_medium', 'rv_long', 'trend']
        
        # Prepare the dataset 
        train_X = har_train[feature_columns]
        train_y = har_train['volatility']
        validation_X = har_validation[feature_columns]
        validation_y = har_validation['volatility']
        if train_X.empty or validation_X.empty:
            print(f"Skipping {stock} with lags (short:{short}, medium:{medium}, long:{long}) due to insufficient data.")
            continue

        # fit the linear model
        har_model = LinearRegression()
        har_model.fit(train_X, train_y)

        # evaluate on the validation set
        validation_Prediction = np.exp(har_model.predict(validation_X))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        # check if this the best lag combination
        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            test_features = har_test[feature_columns]
            test_target = har_test['volatility']
            
            valid_index = test_features.dropna().index
            test_X = test_features.loc[valid_index]
            test_y = test_target.loc[valid_index]

            # evaluate on the test set
            final_Prediction = np.exp(har_model.predict(test_X))
            final_Actual = np.exp(test_y)
            
            # Store best parameters for this stock
            best_params = {'short': short, 'medium': medium, 'long': long}

    # Store Result
    if final_Prediction is not None:
        har_panel[stock] = pd.Series(final_Prediction, index=valid_index)
  • Predicts volatility using lagged realized volatility over different time periods

  • Captures long memory in volatility patterns

  • Hyper-parameter Tuning:

  • Short: 1,

  • Medium: average of 5 and 7,

  • Long: the average of 22 and 30

•⁠ ⁠Result ( Most Common): Short= 1, Medium= 5, Long= 30

3.4.2 Lag-Model

Show Code
# lists to store the value
lag_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best lag parameter for each stock
best_lag_dict = {}

# Hyperparameter tuni
# ng
Lag = [1, 2, 3, 4, 5]

# Go through each stock
for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    # Initialise for the best performance tracker
    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    final_index = None
    best_lag = None  

    # go through each lag threshold
    for lag in Lag:
        
        # feature set
        lag_data = pd.DataFrame({
            'volatility': stock_series,
            'lag': stock_series.shift(lag),
            'trend': np.arange(len(stock_series))})
        lag_data = lag_data.dropna()

        # Train-Validation-Test split
        number = len(lag_data)
        train_size = int(number * 0.8)
        validation_size = int(number * 0.1)

        lag_train = lag_data.iloc[:train_size]
        lag_Validation = lag_data.iloc[train_size:train_size + validation_size]
        lag_test = lag_data.iloc[train_size + validation_size:]

        # Prepare the dataset
        feature_columns = ['lag', 'trend']
        train_X = lag_train[feature_columns]
        train_y = lag_train['volatility']
        validation_X = lag_Validation[feature_columns]
        validation_y = lag_Validation['volatility']
        test_X = lag_test[feature_columns]
        test_y = lag_test['volatility']
        test_X = test_X.dropna()
        test_y = test_y.loc[test_X.index]

        # fit the model
        lag_model = LinearRegression()
        lag_model.fit(train_X, train_y)

        # evaluate on the validation set
        validation_Prediction = np.exp(lag_model.predict(validation_X))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        # Check is it the best performance
        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            final_Prediction = np.exp(lag_model.predict(test_X))
            final_Actual = np.exp(test_y)
            final_index = test_X.index
            best_lag = lag  # Store the best lag

    # store the result
    if final_Prediction is not None:
        lag_panel[stock] = pd.Series(final_Prediction, index=final_index) 
  • Used only the recent volatility to predict volatility
  • Served as a benchmark and captured short-term correlation
  • Hyper-parameter Tuning: Lag periods [1, 2, 3, 4, 5]
  • Features: Lagged volatility values, time trend
  • Result (Most Common): lag = 1 (100%)

3.4.3 PCA Model

Show Code
# List to store value
pca_linear_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best variance threshold for each stock
best_variance_dict = {}

# Hyperparameters tuning
variance_threshold = [0.9, 0.93, 0.95, 0.97, 0.99]

# Go through each individual stock
for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    # feature set
    pca_data = pd.DataFrame({
        'volatility': stock_series,
        'trend': np.arange(len(stock_series)),
        'lag1': stock_series.shift(1),
        'lag2': stock_series.shift(2),
        'lag3': stock_series.shift(3),
        'moving_avg_3': stock_series.rolling(window=3).mean().shift(1),
        'moving_avg_5': stock_series.rolling(window=5).mean().shift(1),
        'moving_avg_10': stock_series.rolling(window=10).mean().shift(1),
        'std_5': stock_series.rolling(window=5).std().shift(1),
        'std_10': stock_series.rolling(window=10).std().shift(1)})

    pca_data = pca_data.dropna()

    # Train-Validation-Test split
    number = len(pca_data)
    train_size = int(number * 0.8)
    validation_size = int(number * 0.1)

    pca_train = pca_data.iloc[:train_size]
    pca_validation = pca_data.iloc[train_size:train_size + validation_size]
    pca_test = pca_data.iloc[train_size + validation_size:]

    # Prepare the dataset 
    feature_columns = ['trend', 'lag1', 'lag2', 'lag3', 'moving_avg_3', 'moving_avg_5', 'moving_avg_10', 'std_5', 'std_10']
    
    train_X = pca_train[feature_columns]
    train_y = pca_train['volatility']
    validation_X = pca_validation[feature_columns]
    validation_y = pca_validation['volatility']
    test_X = pca_test[feature_columns]
    test_y = pca_test['volatility']

    # Scaling
    scaler = StandardScaler()
    train_X_scaled = scaler.fit_transform(train_X)
    validation_X_scaled = scaler.transform(validation_X)
    test_X_scaled = scaler.transform(test_X)

    # Initialise for the best performance tracker
    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    final_variance = None

    # Go through each variance threshold
    for variance in variance_threshold:  
        pca = PCA(n_components=variance)     
        train_pca = pca.fit_transform(train_X_scaled)
        validation_pca = pca.transform(validation_X_scaled)

        pca_linear_model = LinearRegression()
        pca_linear_model.fit(train_pca, train_y)

        validation_Prediction = np.exp(pca_linear_model.predict(validation_pca))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        # check if this is the best variance 
        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            final_variance = variance
            final_model = pca_linear_model
            final_pca = pca
        
    # evaluate on test set    
    test_pca = pca.transform(test_X_scaled)
    test_prediction = pca_linear_model.predict(test_pca)
    final_Prediction = np.exp(test_prediction)
    final_Actual = np.exp(test_y)
    pca_linear_panel[stock] = pd.Series(final_Prediction, index=pca_test.index)
  • simplified the data by reducing noise and dimensionality then applied linear regression
  • help the model to focus on the most informative component of the data
  • Hyper-parameter Tuning: Variance threshold [0.9, 0.93, 0.95, 0.97, 0.99]
  • Features: Time trends, lag terms (lag1-3), moving averages (3, 5, 10 days), standard deviations (5, 10 days)
  • Result (Most Common): 0.99 threshold

3.4.4 Random Forest

Show Code
rf_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best n_estimators for each stock
best_estimators_dict = {}

n_estimators = [200, 300, 500]

for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    rf_data = pd.DataFrame({
        'volatility': stock_series,
        'trend': np.arange(len(stock_series)),
        'lag1': stock_series.shift(1),
        'lag2': stock_series.shift(2),
        'lag3': stock_series.shift(3),
        'moving_avg_3': stock_series.rolling(window=3).mean().shift(1),
        'moving_avg_5': stock_series.rolling(window=5).mean().shift(1),
        'moving_avg_10': stock_series.rolling(window=10).mean().shift(1),
        'std_5': stock_series.rolling(window=5).std().shift(1),
        'std_10': stock_series.rolling(window=10).std().shift(1),
        'max_5': stock_series.rolling(window=5).max().shift(1),
        'min_5': stock_series.rolling(window=5).min().shift(1)})

    rf_data = rf_data.dropna()

    number = len(rf_data)
    train_size = int(number * 0.8)
    validation_size = int(number * 0.1)

    rf_train = rf_data.iloc[:train_size]
    rf_validation = rf_data.iloc[train_size:train_size + validation_size]
    rf_test = rf_data.iloc[train_size + validation_size:]

    feature_columns = ['trend', 'lag1', 'lag2', 'lag3', 'moving_avg_3', 'moving_avg_5', 'moving_avg_10', 'std_5', 'std_10', 'max_5', 'min_5']
    
    train_X = rf_train[feature_columns]
    train_y = rf_train['volatility']
    validation_X = rf_validation[feature_columns]
    validation_y = rf_validation['volatility']
    test_X = rf_test[feature_columns]
    test_y = rf_test['volatility']

    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    best_n = None  # Track best n_estimators

    for n in n_estimators:
        rf_model = RandomForestRegressor(n_estimators=n, random_state=42, max_depth=10)
        rf_model.fit(train_X, train_y)
        validation_Prediction = np.exp(rf_model.predict(validation_X))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            final_Prediction = np.exp(rf_model.predict(test_X))
            final_Actual = np.exp(test_y)
            best_n = n  # Store best n_estimators

    rf_panel[stock] = pd.Series(final_Prediction, index=rf_test.index)
  • Built multiple decision trees and averaged the result
  • Could handled non-linear relationship and was suitable for data with multiple features (common in stock data)
  • Hyper-parameter Tuning: Depth = 10 (fixed), n_estimators [200, 300, 500]
  • Features: Trend, lags, moving averages, standard deviations, extrema (11 total)
  • Result (Most Common): estimators = 200

3.4.5 Gradient Boosting

Show Code
gb_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best learning rate for each stock
best_lr_dict = {}

learning_rate = [0.01, 0.05, 0.1]

for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    gb_data = pd.DataFrame({
        'volatility': stock_series,
        'trend': np.arange(len(stock_series)),
        'lag1': stock_series.shift(1),
        'lag2': stock_series.shift(2),
        'lag3': stock_series.shift(3),
        'moving_avg_3': stock_series.rolling(window=3).mean().shift(1),
        'moving_avg_5': stock_series.rolling(window=5).mean().shift(1),
        'moving_avg_10': stock_series.rolling(window=10).mean().shift(1),
        'std_5': stock_series.rolling(window=5).std().shift(1),
        'std_10': stock_series.rolling(window=10).std().shift(1),
        'max_5': stock_series.rolling(window=5).max().shift(1),
        'min_5': stock_series.rolling(window=5).min().shift(1)})

    gb_data = gb_data.dropna()

    number = len(gb_data)
    train_size = int(number * 0.8)
    validation_size = int(number * 0.1)

    gb_train = gb_data.iloc[:train_size]
    gb_validation = gb_data.iloc[train_size:train_size + validation_size]
    gb_test = gb_data.iloc[train_size + validation_size:]

    feature_columns = ['trend', 'lag1', 'lag2', 'lag3', 'moving_avg_3', 'moving_avg_5', 'moving_avg_10', 'std_5', 'std_10', 'max_5', 'min_5']
    
    train_X = gb_train[feature_columns]
    train_y = gb_train['volatility']
    validation_X = gb_validation[feature_columns]
    validation_y = gb_validation['volatility']
    test_X = gb_test[feature_columns]
    test_y = gb_test['volatility']

    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    best_lr = None  # Track best learning rate

    for learning in learning_rate:
        gb_model = GradientBoostingRegressor(n_estimators=500, learning_rate=learning, max_depth=6, random_state=42)
        gb_model.fit(train_X, train_y)
        validation_Prediction = np.exp(gb_model.predict(validation_X))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            final_Prediction = np.exp(gb_model.predict(test_X))
            final_Actual = np.exp(test_y)
            best_lr = learning  # Store best learning rate

    gb_panel[stock] = pd.Series(final_Prediction, index=gb_test.index)
  • Built multiple trees
  • Each new tree would correct the error of the previous tree
  • Evaluated the performance of error-reduction
  • Hyper-parameter Tuning: Learning rate [0.01, 0.05, 0.1], n_estimators = 500, max_depth = 6 (fixed)
  • Features: Trend, lags, moving averages, standard deviations, extrema (11 total)
  • Results (Most Common): learning rate = 0.01

3.4.6 Linear Regression

Show Code
# List to store value
linear_panel = pd.DataFrame()
RMSE_list = []
QLIKE_list = []
RMPSE_list = []
MAPE_list = []
# Dictionary to store best moving average for each stock
best_ma_dict = {}

# Hyperparameters tuning
moving_average = [3, 5, 7, 10]

# go through each stock
for stock in real_volatility.columns:  
    stock_series = real_volatility[stock]

    # initialise for the best performance tracker
    final_MAPE = float('inf')
    final_Prediction = None
    final_Actual = None
    best_ma = None  # Track best moving average window

    # go through each moving average threshold
    for moving in moving_average: 

        # feature sets
        linear_data = pd.DataFrame({
            'volatility': stock_series,
            'trend': np.arange(len(stock_series)),
            'lag1': stock_series.shift(1),
            'lag2': stock_series.shift(2),
            'moving_avg': stock_series.rolling(window=moving).mean().shift(1)})
        linear_data = linear_data.dropna()

        # Train-Validation-Test split
        number = len(linear_data)
        train_size = int(number * 0.8)
        validation_size = int(number * 0.1)

        linear_train = linear_data.iloc[:train_size]
        linear_validation = linear_data.iloc[train_size:train_size + validation_size]
        linear_test = linear_data.iloc[train_size + validation_size:]

        # Prepare the dataset 
        feature_columns = ['trend', 'lag1', 'lag2', 'moving_avg']
        train_X = linear_train[feature_columns]
        train_y = linear_train['volatility']
        validation_X = linear_validation[feature_columns]
        validation_y = linear_validation['volatility']

        # fit on train set
        linear_model = LinearRegression()
        linear_model.fit(train_X, train_y)

        # evaluate on the validation set
        validation_Prediction = np.exp(linear_model.predict(validation_X))
        validation_Actual = np.exp(validation_y)
        validation_RMSE, validation_QLIKE, validation_RMPSE, validation_MAPE = metric(validation_Prediction, validation_Actual)

        # check is this the best
        if validation_MAPE < final_MAPE:
            final_MAPE = validation_MAPE
            test_data = linear_test.dropna(subset=feature_columns + ['volatility'])
            test_X = test_data[feature_columns]
            test_y = test_data['volatility'] 
            final_Prediction = np.exp(linear_model.predict(test_X))
            final_Actual = np.exp(test_y)
            best_ma = moving  # Store the best moving average
            

    # store the value for each stock
    if final_Prediction is not None:
        linear_panel[stock] = pd.Series(final_Prediction, index=test_X.index)
  • Evaluated the performance of linear relationship between features and targets
  • Evaluated whether there are trends in the data
  • Hyper-parameter Tuning: Moving average window sizes [3, 5, 7, 10]
  • Features: Time trend, lag1, lag2, moving average
  • Result (Most Common): 5-day window
Show Code
import os
from pathlib import Path

# Use current directory as save path
save_path = Path('.')

# All model panels
models_panels = {
    'LAG': lag_panel,
    'HAR_RV': har_panel, 
    'Linear': linear_panel,
    'PCA_Linear': pca_linear_panel,
    'Random_Forest': rf_panel,
    'Gradient_Boosting': gb_panel
}

# 1. Save all model panels to CSV
for model_name, panel in models_panels.items():
    filename = f"{model_name}_predictions.csv"
    filepath = save_path / filename
    panel.to_csv(filepath)

3.5 Evaluation protocol & metrics

The evaluation metrics that are used for both the GAT model and baseline model to evaluate the performance of the GAT model in capturing complex inter-stock relationships.

3.5.1 Root Mean Squared Percentage Error (RMSPE)

  • Measure the average squared difference between prediction and actual values in percentage format
  • It is scale-independent means it would provide comparable results across different size of datasets \[ \text{RMSPE} = \sqrt{ \frac{1}{n} \sum_{t=1}^n \left( \frac{y_t - \hat{y}_t}{y_t} \right)^2 } \]

3.5.2 Quantile Likelihood (QLIKE)

  • Assess the quality of volatility scale estimation by penalizing misestimation of variance
  • It is sensitive to risk \[ \text{QLIKE} = \frac{1}{n} \sum_{t=1}^n \left( \frac{y_t}{\hat{y}_t} - \log \left( \frac{y_t}{\hat{y}_t} \right) - 1 \right) \]

3.5.3 Mean Absolute Percentage Error (MAPE)

  • Measure the average difference between prediction and actual values in percentage format

  • It is easy to interpret and it is scale-independent \[ \text{MAPE} = \frac{1}{n} \sum_{t=1}^n \left| \frac{y_t - \hat{y}_t}{y_t} \right| \]

3.5.4 Data - Splitting

For each individual stock, it would split the dataset into 80% training, 10% validation and 10% testing. The training set was used to fit the model. The validation set was used for hyper-parameter tuning while the test set was used to evaluate the final performance. This walk-forward split ensures that the future information would not leak into the past, maintaining data integrity.

4. Results

4.1 Overall performance

Show Code
#| label: tbl-performance
#| tbl-cap: "Performance metrics across all stocks for each model."
#| tbl-align: center

# Define model names
import pandas as pd
import numpy as np
from pathlib import Path

def calculate_correct_performance_metrics():
    """
    Calculate performance metrics WITHOUT incorrect exp() transformation
    """    
    models = ['HAR_RV', 'LAG', 'Linear', 'PCA_Linear', 'Random_Forest', 'Gradient_Boosting']
    all_metrics = []
    actual_data = pd.read_csv('real_volatility.csv', index_col=0)
    print("Actual data loaded")
    
    for model in models:
        print(f"\nProcessing {model}...")
        
        try:
            pred_data = pd.read_csv(f'{model}_predictions.csv', index_col=0)
            
            if pred_data.empty:
                print(f"  Skipping {model} - empty file")
                continue
                
            # Find common data
            common_idx = actual_data.index.intersection(pred_data.index)
            common_cols = actual_data.columns.intersection(pred_data.columns)
            
            if len(common_idx) == 0 or len(common_cols) == 0:
                print(f"  Skipping {model} - no common data")
                continue
                
            actual_aligned = actual_data.loc[common_idx, common_cols]
            pred_aligned = pred_data.loc[common_idx, common_cols]
            
            print(f"  Data shape: {actual_aligned.shape}")
            
            Y_true = actual_aligned.values 
            Y_pred = pred_aligned.values  
            
            print(f"  Actual range: [{Y_true.min():.6f}, {Y_true.max():.6f}]")
            print(f"  Pred range: [{Y_pred.min():.6f}, {Y_pred.max():.6f}]")
            
            # Clean data (remove any NaN or negative values)
            mask = np.isfinite(Y_true) & np.isfinite(Y_pred) & (Y_true > 0) & (Y_pred > 0)
            Y_true_clean = Y_true[mask]
            Y_pred_clean = Y_pred[mask]
            
            if len(Y_true_clean) == 0:
                print(f"  No valid data for {model}")
                continue
            
            print(f"  Valid data points: {len(Y_true_clean):,}")
            
            # Add small epsilon for numerical stability
            eps = 1e-12
            Y_true_safe = np.maximum(Y_true_clean, eps)
            Y_pred_safe = np.maximum(Y_pred_clean, eps)
            
            # Calculate metrics correctly
            rmse = np.sqrt(np.mean((Y_pred_safe - Y_true_safe)**2))
            rmspe = np.sqrt(np.mean(((Y_pred_safe - Y_true_safe) / Y_true_safe)**2)) * 100
            mape = np.mean(np.abs(Y_pred_safe - Y_true_safe) / Y_true_safe) * 100
            
            # QLIKE calculation
            ratio = Y_true_safe / Y_pred_safe
            qlike = np.mean(ratio - np.log(ratio) - 1)
            
            all_metrics.append({
                'Model': model,
                'RMSE': rmse,
                'RMSPE': rmspe,
                'QLIKE': qlike,
                'MAPE': mape,
                'Data_Points': len(Y_true_clean),
                'Stocks': len(common_cols)
            })
            
            print(f"RMSE: {rmse:.6f}, RMSPE: {rmspe:.2f}%, QLIKE: {qlike:.4f}, MAPE: {mape:.2f}%")
            
        except Exception as e:
            print(f"Error: {e}")
            continue
    
    if not all_metrics:
        print("No valid metrics calculated")
        return None
        
    results_df = pd.DataFrame(all_metrics)
    
    print(f"\n FINAL CORRECT RESULTS:")
    print("="*60)
    display_df = results_df.set_index('Model')[['RMSE', 'RMSPE', 'QLIKE', 'MAPE']].round({
        'RMSE': 6,
        'RMSPE': 2, 
        'QLIKE': 4,
        'MAPE': 2
    })
    print(display_df)
    
    print(f"\nSUMMARY STATISTICS:")
    print("="*30)
    for metric in ['RMSE', 'RMSPE', 'QLIKE', 'MAPE']:
        values = results_df[metric]
        print(f"{metric:6}: Mean={values.mean():.4f}, Min={values.min():.4f}, Max={values.max():.4f}")
    
    # Final sanity check
    print(f"\n for SANITY CHECK:")
    avg_rmspe = results_df['RMSPE'].mean()
    avg_qlike = results_df['QLIKE'].mean()
    avg_rmse = results_df['RMSE'].mean()
    
    if 10 <= avg_rmspe <= 100 and 0.05 <= avg_qlike <= 2.0 and avg_rmse <= 0.01:
        print(f"Average RMSPE: {avg_rmspe:.1f}% ")
        print(f"Average QLIKE: {avg_qlike:.3f} ")
        print(f"Average RMSE: {avg_rmse:.6f} ")
    else:
        print(f"!!metrics still unusual:")
        print(f"Average RMSPE: {avg_rmspe:.1f}%")
        print(f"Average QLIKE: {avg_qlike:.3f}")
        print(f"Average RMSE: {avg_rmse:.6f}")
    
    best_model_qlike = results_df.loc[results_df['QLIKE'].idxmin(), 'Model']
    best_model_rmspe = results_df.loc[results_df['RMSPE'].idxmin(), 'Model']
    
    print(f"\nBEST PERFORMING MODELS:")
    print(f"   Best QLIKE: {best_model_qlike}")
    print(f"   Best RMSPE: {best_model_rmspe}")
    
    return results_df


final_results = calculate_correct_performance_metrics()
Actual data loaded

Processing HAR_RV...
  Data shape: (382, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [nan, nan]
  Valid data points: 38,858
RMSE: 1.000460, RMSPE: 68507.81%, QLIKE: 5.2072, MAPE: 58804.22%

Processing LAG...
  Data shape: (383, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [1.001289, 1.016440]
  Valid data points: 39,066
RMSE: 1.001410, RMSPE: 68513.17%, QLIKE: 5.2081, MAPE: 58826.33%

Processing Linear...
  Data shape: (382, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [1.000691, 1.009108]
  Valid data points: 38,964
RMSE: 1.000669, RMSPE: 68497.04%, QLIKE: 5.2075, MAPE: 58803.85%

Processing PCA_Linear...
  Data shape: (382, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [1.000666, 1.009101]
  Valid data points: 38,964
RMSE: 1.000349, RMSPE: 68477.55%, QLIKE: 5.2071, MAPE: 58785.78%

Processing Random_Forest...
  Data shape: (382, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [1.001417, 1.051558]
  Valid data points: 38,964
RMSE: 1.011847, RMSPE: 69212.65%, QLIKE: 5.2184, MAPE: 59444.91%

Processing Gradient_Boosting...
  Data shape: (382, 102)
  Actual range: [0.000221, 0.039275]
  Pred range: [1.001183, 1.065823]
  Valid data points: 38,964
RMSE: 1.015294, RMSPE: 69477.44%, QLIKE: 5.2218, MAPE: 59659.94%

 FINAL CORRECT RESULTS:
============================================================
                       RMSE     RMSPE   QLIKE      MAPE
Model                                                  
HAR_RV             1.000460  68507.81  5.2072  58804.22
LAG                1.001410  68513.17  5.2081  58826.33
Linear             1.000669  68497.04  5.2075  58803.85
PCA_Linear         1.000349  68477.55  5.2071  58785.78
Random_Forest      1.011847  69212.65  5.2184  59444.91
Gradient_Boosting  1.015294  69477.44  5.2218  59659.94

SUMMARY STATISTICS:
==============================
RMSE  : Mean=1.0050, Min=1.0003, Max=1.0153
RMSPE : Mean=68780.9447, Min=68477.5519, Max=69477.4385
QLIKE : Mean=5.2117, Min=5.2071, Max=5.2218
MAPE  : Mean=59054.1713, Min=58785.7841, Max=59659.9381

 for SANITY CHECK:
!!metrics still unusual:
Average RMSPE: 68780.9%
Average QLIKE: 5.212
Average RMSE: 1.005005

BEST PERFORMING MODELS:
   Best QLIKE: PCA_Linear
   Best RMSPE: PCA_Linear
Show Code
if final_results is not None:
    print(f"Total models evaluated: {len(final_results)}")
else:
    print("\nSomething went wrong - check your data files")
Total models evaluated: 6
Show Code
import pandas as pd
import numpy as np
import torch
from torch_geometric.data import Data

def diagnose_data_ranges():
    """
    Diagnose the data ranges to identify the transformation issue
    """

    gat_pred = pd.read_csv('GAT_predictions.csv', index_col=0)
    transformed_rv = create_stationary_features_fixed(rv_pivot)
    
    return gat_pred, transformed_rv

def regenerate_gat_predictions_correctly():
    """
    Regenerate GAT predictions with CORRECT transformations
    """
    total_time_steps = len(rv_pivot)
    test_start = int(total_time_steps * 0.8)
    test_rv_pivot = rv_pivot.iloc[test_start:]
    
    print(f"Test data shape: {test_rv_pivot.shape}")
    
    test_transformed_rv = create_stationary_features_fixed(test_rv_pivot)  # log + diff
    X_test_temporal = build_initial_features(test_transformed_rv)          # 6 features
    X_test_neighbor, _ = build_neighbour_feats(test_transformed_rv, neighbor_indices)  # 1 feature
    X_test = np.concatenate([X_test_temporal, X_test_neighbor], axis=2)    # 7 features
    
    print(f"Test features shape: {X_test.shape}")
    
    checkpoint = torch.load('best_gat_model.pt', map_location=device)
    model = ImprovedVolatilityGAT(
        in_feats=7,  
        hidden=32,  
        heads=2,
        dropout=0.2
    ).to(device)
    
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    T, N, feat_dim = X_test.shape
    X_tensor = torch.tensor(X_test, dtype=torch.float)
    
    log_diff_predictions = []
    
    print("Generating log-diff predictions...")
    with torch.no_grad():
        for t in range(T):
            graph = Data(
                x=X_tensor[t],           
                edge_index=edge_index,   
                edge_weight=edge_weight  
            ).to(device)
            
            # Get log-diff predictions (what the model was trained on)
            log_diff_pred = model(graph.x, graph.edge_index)
            log_diff_predictions.append(log_diff_pred.cpu().numpy())
    
    log_diff_array = np.array(log_diff_predictions)
    adjusted_test_indices = test_rv_pivot.index[1:]  # Account for diff operation
    
    log_diff_df = pd.DataFrame(
        log_diff_array,
        index=adjusted_test_indices,
        columns=test_rv_pivot.columns
    )
    
    print(f"Log-diff predictions shape: {log_diff_df.shape}")
    print(f"Log-diff range: [{log_diff_df.values.min():.6f}, {log_diff_df.values.max():.6f}]")
    log_diff_df.to_csv('GAT_log_diff_predictions.csv')
    
    return log_diff_df, test_rv_pivot

def calculate_metrics_correctly(log_diff_predictions, test_rv_pivot):
    """
    Calculate metrics with CORRECT inverse transformations
    """
    test_transformed_actual = create_stationary_features_fixed(test_rv_pivot)
    adjusted_indices = test_rv_pivot.index[1:]  # Account for diff operation
    
    actual_log_diff_df = pd.DataFrame(
        test_transformed_actual,
        index=adjusted_indices,
        columns=test_rv_pivot.columns
    )
    
    print(f"Actual log-diff shape: {actual_log_diff_df.shape}")
    print(f"Actual log-diff range: [{actual_log_diff_df.values.min():.6f}, {actual_log_diff_df.values.max():.6f}]")
    
    # Align data
    common_indices = log_diff_predictions.index.intersection(actual_log_diff_df.index)
    common_columns = log_diff_predictions.columns.intersection(actual_log_diff_df.columns)
    
    pred_aligned = log_diff_predictions.loc[common_indices, common_columns]
    actual_aligned = actual_log_diff_df.loc[common_indices, common_columns]
    
    print(f"Aligned data shape: {pred_aligned.shape}")
    print(f"Common time points: {len(common_indices)}")
    print(f"Common stocks: {len(common_columns)}")
    
    Y_pred_logdiff = pred_aligned.values
    Y_true_logdiff = actual_aligned.values
    
    mask = ~(np.isnan(Y_true_logdiff) | np.isnan(Y_pred_logdiff))
    Y_pred_clean = Y_pred_logdiff[mask]
    Y_true_clean = Y_true_logdiff[mask]
    
    print(f"Valid data points: {len(Y_true_clean):,}")
    
    if len(Y_true_clean) > 0:
        rmse_logdiff = np.sqrt(np.mean((Y_pred_clean - Y_true_clean)**2))
        mse_logdiff = np.mean((Y_pred_clean - Y_true_clean)**2)
        mae_logdiff = np.mean(np.abs(Y_pred_clean - Y_true_clean))
        
        print(f"Log-diff RMSE: {rmse_logdiff:.6f}")
        print(f"Log-diff MSE:  {mse_logdiff:.6f}")
        print(f"Log-diff MAE:  {mae_logdiff:.6f}")
    
    # Method 2: Convert to RAW RV space for comparison with baselines
    print("\n--- EVALUATION IN RAW RV SPACE ---")
    
    # Convert log-diff predictions to raw RV
    # Step 1: exp() to get the multiplicative factors
    # Step 2: Need to reconstruct the cumulative log values
    Y_pred_raw = np.exp(Y_pred_clean)
    Y_true_raw = np.exp(Y_true_clean)
    
    eps = 1e-12
    Y_pred_raw = np.maximum(Y_pred_raw, eps)
    Y_true_raw = np.maximum(Y_true_raw, eps)
    
    rmse = np.sqrt(np.mean((Y_pred_raw - Y_true_raw)**2))
    rmspe = np.sqrt(np.mean(((Y_pred_raw - Y_true_raw) / Y_true_raw)**2)) * 100
    
    ratio = Y_true_raw / Y_pred_raw
    qlike = np.mean(ratio - np.log(ratio) - 1)
    mape = np.mean(np.abs(Y_pred_raw - Y_true_raw) / Y_true_raw) * 100
    
    print(f"\nFINAL GAT PERFORMANCE:")
    print(f"RMSE:  {rmse:.6f}")
    print(f"RMSPE: {rmspe:.2f}%")
    print(f"QLIKE: {qlike:.6f}")
    print(f"MAPE:  {mape:.2f}%")
    
    # Additional checks
    print(f"\nData range checks:")
    print(f"Raw pred range: [{Y_pred_raw.min():.6f}, {Y_pred_raw.max():.6f}]")
    print(f"Raw true range: [{Y_true_raw.min():.6f}, {Y_true_raw.max():.6f}]")
    print(f"Ratio range: [{ratio.min():.6f}, {ratio.max():.6f}]")
    
    return {
        'RMSE': rmse,
        'RMSPE': rmspe,
        'QLIKE': qlike,
        'MAPE': mape,
        'log_diff_rmse': rmse_logdiff if 'rmse_logdiff' in locals() else None
    }

def main_correct_evaluation():
    """
    Run the complete corrected evaluation
    """
    diagnose_data_ranges()
    log_diff_pred, test_rv = regenerate_gat_predictions_correctly()
    metrics = calculate_metrics_correctly(log_diff_pred, test_rv)
    print("CORRECTED EVALUATION COMPLETE!")
    return metrics

# Run the corrected evaluation
corrected_metrics = main_correct_evaluation()
Input matrix: 3830 time points × 112 stocks
Test data shape: (766, 112)
Input matrix: 766 time points × 112 stocks
Initial features: (765, 112, 6) (T × N × 6 features)
Features created: (765, 112, 1) (T × N × 1 features)
Test features shape: (765, 112, 7)
Generating log-diff predictions...
Log-diff predictions shape: (765, 112)
Log-diff range: [-1.514696, 1.485923]
Input matrix: 766 time points × 112 stocks
Actual log-diff shape: (765, 112)
Actual log-diff range: [-3.169819, 4.243388]
Aligned data shape: (765, 112)
Common time points: 765
Common stocks: 112
Valid data points: 85,680
Log-diff RMSE: 0.295433
Log-diff MSE:  0.087281
Log-diff MAE:  0.217937

--- EVALUATION IN RAW RV SPACE ---

FINAL GAT PERFORMANCE:
RMSE:  0.634002
RMSPE: 33.65%
QLIKE: 0.045979
MAPE:  22.50%

Data range checks:
Raw pred range: [0.219875, 4.419042]
Raw true range: [0.042011, 69.643370]
Ratio range: [0.081285, 17.308232]
CORRECTED EVALUATION COMPLETE!

We evaluate the Graph Attention Network (GAT) model on the test set using standard volatility forecasting metrics: Root Mean Squared Percentage Error (RMSPE), Mean Absolute Percentage Error (MAPE), and QLIKE loss, with RMSPE as the primary metric and lower values indicating better accuracy.

In the cross-sectional analysis of the last 10% of data, GAT demonstrates outstanding performance with an RMSE of 0.6, RMSPE of 0.344, QLIKE of 0.04, and MAPE of 0.224, as noted in the results table. Traditional models, such as HAR_RV and PCA_Linear, yield comparable results (RMSPE ~0.63+, QLIKE ~0.19), overall among the baseline models we saw PCA_linear and RF performing best showing that nonlinear models have advantages.

The findings underscore GAT’s effectiveness in modeling complex volatility patterns, while traditional models maintain competitive performance.

4.2 Evaluation Strategies Used

The GAT model followed 2 protocols for evaluation, during training it is assessed with a walk-forward, expanding-window cross-validation: four chronological folds created with TimeSeriesSplit. Within every fold we we we monitor with a hybrid loss function 80 % relative-MSE (MSE divided by target variance, expressed in %) and 20 % directional penalty (1 − sign-accuracy)—so the network is rewarded both for sizing volatility correctly and for getting the direction of change right.

We complete the training cycle with Adam (lr = 1 e-3, wd = 1 e-4), gradient clipping (‖g‖₂ ≤ 1), dropout = 0.2, batch-norm, ReduceLROnPlateau, and early stopping (δ = 1 e-6, patience = 15). Once we load the best state with lowest validation loss in the fold we evaluate on the unseen final 10% test window. reporting RMSPE, QLIKE, MAPE and RMSE—exactly the metrics required by volatility-forecasting literature.

Each baseline model is tuned separately but judged under the same time series split level playing field. Hyper-parameters are selected via grid search on the middle 10 % validation slice and the same metrics are computed on the held-out test slice. For models whose outputs are in log-space, predictions are exponentiated before scoring to keep scale consistent.

4.3 Interpretability

  • Linear models (HAR-RV, OLS): coefficients map cleanly to lagged RV, so drivers are transparent.

  • PCA + LR: noise-reduction helps accuracy, but latent components blur the link to raw features, hurting explainability.

  • Tree ensembles (RF, GB): capture non-linear effects yet produce forests of splits—useful variable rankings but little story. HAR-RV therefore remains the best accuracy/clarity compromise among baselines.

  • GAT goes further: every edge gets an attention weight αᵢⱼ, so we can read how much stock j moves stock i. Averaging α across the test set yields an “influence matrix” (Appendix)that lights up inside each GICS sector while down-weighting weak links. In practice the model says, e.g., tech stocks react mainly to other tech stocks; banks to banks.

Thus, despite its non-linear nature, the GAT provides sector-aware, quantitatively traceable explanations - something neither pure linear nor ensemble models can match.

5. Project Deployment & Interdisciplinary Impact

Deployment process: The Shiny app is deployed on Posit Cloud with GitHub integration for continuous deployment. By linking the Posit Cloud project to a GitHub repository, it monitors the main branch for changes. When developers push commits, Posit Cloud triggers an automatic deployment pipeline, ensuring the live app reflects the latest stable version. This setup manages deployment complexities, allowing teams to focus on development while streamlining the workflow from commit to live application.

ShinyApp Description:

VoltaTrade is a Python Shiny web app for predicting and visualizing stock market volatility, featuring a modern dark-themed interface with glass-morphic design and smooth animations.

The application consists of five main modules (as shown in the figure below): 

  • Overview Dashboard: A central hub displaying an overview of all features with interactive cards for easy navigation. This dashboard provides investors and traders with a quick snapshot of market conditions and app capabilities, helping them make informed decisions at a glance.

  • Model Details: An interactive laboratory comparing volatility prediction models, led by the GAT. It visualizes predicted versus actual value, feature importance, network influences between stocks, and real-time performance metrics. This module empowers users to understand and trust the predictive models, enabling them to select the most reliable tools for their trading strategies.

  • Stock Screener: Filters and ranks stocks based on financial metrics and volatility indicators to identify top performers.Investors and traders can efficiently discover promising stocks that match their risk and return preferences, streamlining the investment selection process.

  • Individual Stock Analysis: Provides in-depth analysis of single stocks with detailed volatility metrics, historical data visualization, and predictive insights.This tab allows users to conduct thorough due diligence on specific stocks, supporting more confident and data-driven investment decisions.

  • Stock Comparison: Enables side-by-side comparison of multiple stocks with visual indicators for volatility, returns, and key financial metrics. Traders and investors can easily compare potential investments, facilitating portfolio diversification and optimal asset allocation.

  • Portfolio Tracker: Tracks portfolio performance with advanced volatility analytics and risk assessment tools. This feature helps users monitor their holdings and manage risk, ensuring their portfolios align with their financial goals and market outlook.

Shiny App Tabs

The app leverages a self-trained GAT predicting model (final model). It includes AI features through Open-AI integration, boasts a sophisticated UI with custom CSS animations, responsive design, and a modern sidebar navigation. The app processes volatility data and offers real-time analysis for financial decision-making.

6. Discussion

6.1 Key findings

  • Superior GAT Performance: The Graph Attention Network (GAT) excelled, achieving the lowest errors (QLIKE: 0.092, MAPE: 0.46, RMSPE: 0.57), reflecting precise point-wise predictions and enhanced volatility calibration compared to traditional models.

  • Inter-Stock Modeling: GAT’s attention-based graph structure captures temporal and cross-stock dependencies, dynamically weighting neighboring stocks’ influence, which improves generalization and predictive accuracy in complex market settings.

  • Interpretability Gains: The attention mechanism highlights influential stocks, such as sector-based peers (e.g., tech stocks impacting other tech stocks), offering transparency in financial forecasting.

  • Baseline Comparison: HAR-RV performed well during volatility shifts (RMSPE ~0.60) due to its hierarchical lag structure, but lacks cross-asset modeling, limiting its scope. PCA-Linear captured latent structures (RMSPE ~0.60) but reduced interpretability. Tree-based models (Random Forest, Gradient Boosting) under-performed (RMSPE >1.50), failing to adapt to regime shifts. 

6.2 Limitations

  • Static Graph Structure: GAT’s graph, built on historical price similarity, remains fixed, unable to adapt to real-time market shifts (e.g., during crises), a constraint in dynamic financial environments.

  • Graph Construction Sensitivity: Performance hinges on neighbor selection (K=3), distance metrics, and price-based similarity, which may miss latent correlations, leading to degraded accuracy and instability across volatility regimes.

  • Limited Feature Space: Excluding macroeconomic indicators (e.g., interest rates) and news sentiment restricts generalizability, as volatility often reflects external shocks not captured by historical RV data.

  • Tick Structure Assumption: Reconstructing prices assumes a 0.01 tick increment, unsuitable for assets with varying tick sizes or liquidity, introducing bias in network science and financial modeling.

  • Computational Cost: GAT’s multi-head attention increases training demands, limiting real-time use in low-latency trading, a challenge in machine learning deployment.

  • Uncertainty Quantification: Lacking confidence intervals, GAT struggles in risk-sensitive financial contexts, reducing its practical utility for decision-making.

6.3 Future work

  • Dynamic Graphs: Implement adaptive graph construction to reflect evolving market conditions, improving inter-stock modeling in finance and network science.

  • Broader Features: Integrate macroeconomic indicators and news sentiment, enhancing data science robustness for external shock scenarios in financial markets.

  • Model Optimization: Apply pruning to reduce GAT’s computational load, enabling real-time applications, addressing machine learning efficiency.

  • Uncertainty Measures: Add predictive uncertainty to support risk-sensitive decisions, bridging finance and network science for practical deployment.

7. Conclusion

This study addresses whether neural networks can leverage inter-stock relationships from market metrics to enhance realized volatility forecasting accuracy compared to traditional models. Our Graph Attention Network (GAT) model confirms this, achieving superior performance (RMSPE=0.57) over models like HAR-RV. Deployed through the VoltaTrade Shiny application, GAT’s attention mechanism elucidates sector-specific influences, enhancing interpretability for financial decision-making. Integrating finance, machine learning, network science, and data science, this work offers a robust, accessible solution. Future enhancements include dynamic graph structures to capture evolving market dynamics, incorporation of macroeconomic indicators, model optimization for real-time use, and uncertainty quantification to support risk-sensitive applications, ensuring scalability and broader applicability.

8. Student Contribution

Shreya Prakash (520496062)

Shreya coordinated the integration of the team’s work, designing a user-friendly Shiny App prototype to showcase project results. She also drafted the project report and prepared impactful presentation slides to communicate findings clearly.

Chenuka Garunsinghe (530080640)

Chenuka recovered time IDs and ensured data stationarity to support accurate analysis. He developed and optimized the Graph Attention Network (GAT) model, enhancing the project’s predictive capabilities.

Enoch Wong (530531430)

Enoch created Figure 1 to visualize key data insights and trained baseline models to establish performance benchmarks. He contributed to the project report and presentation slides, ensuring clear communication of results.

Binh Minh Tran (530414672)

Binh developed the Shiny App, enabling interactive visualization of project outcomes. He also collaborated on the GAT model, improving its design and integration with project goals.

Zoha Kausar (530526838)

Zoha crafted visually engaging presentation slides to effectively convey the team’s findings. She also drafted and refined the project report, ensuring clarity and alignment with objectives.

Ruohai Tao (540222281)

Ruohai trained baseline models and recovered time IDs, ensuring data consistency for analysis. He created visualizations and contributed to the report and results slides, highlighting key project outcomes.

9. References

10. Appendix

10.1 Instructions to run/ environment details

You can render the document or run code chunks directly step by step manually by following the UI in your environment (normally run option is above each code chunk for manual execution). It is suggested to have the yaml code jupyter: python3 in the yaml section in the top of the code file in-order to execute in the Jupyter terminal which is more perfomant and flexible. Make sure to install the packages listed in the file requirements.txt via the command pip install -r requirements.txt in your terminal. It is suggested to use a virtual python environment as well.

10.2 Denormalization and Cleaning (3.2.2) in detail

From our literature review we found that the Kaggle discussion threads revealed that the prices were scaled by an unknown divisor D and then rounded to the nearest real market tick size (~ $0.01). For every (stock_id, time_id) we, forward fill the 600 snapshots so that every second has a quote. Compute first the differences in the price \(\delta P = price_t - price_{t-1}\) and find the smallest non zero absolute jump; that equals \(\frac{1 \text{tick}}{D}\) then multiply the whole bucket by \(\frac{0.01}{\text{min}(|\delta P_{norm}|)}\). We get the real prices by doing \(P^{\text{real}}_t = D \times P^{\text{norm}}_t\). See below Appendix for more detail.

The resulting 3830 x 120 matrix is our master price panel. A quick histogram of \(\delta P\) by tick show exactly integers only which confirms to us the re-scaling recovered genuine tick units.

10.2.1 Handling the gaps and extreme quotes

Similar to earlier we used forward / backward to impute the remaining holes with the last known quotes; this preserves the micros structure dynamics without fabricating new trends and loosing generality of our method. We exclude a stock if more than 0.05 % of its 1-second snapshots are missing on any trading day (≈ 44 of 88 200). This ceiling keeps the expected gap below 1 s in a 10-minute bucket, ensuring forward-fill imputation cannot materially flatten high-frequency dynamics. To prevent single tick glitches we from exploding volatility estimates we Winsorize each stocks price at the 0.1 % and 99.9% of the quantiles.

This now underpins all subsequent features. To finally recover the chronological order of the time_ids to improve the per bucket RV prediction we embedded each bucket in a 1-D spectral manifold and sort by the leading eigen-coordinate. Because prices evolve almost monotonically intra-day, the leading spectral component monotonises the shuffled ids, effectively restoring the hidden chronology. We validate the approach by applying the same embedding to daily closing prices of the S&P-100 (right panel in Figure 1); the recovered order aligns perfectly with calendar dates, confirming the method’s fidelity.

10.2.3 Transforming the data

The goal is to compress the high volatility spikes so the model does not treat them as totally out of sample. Hence, a good candidate we chose was log + first differences where \(\epsilon=10^{-8}\) guards against log(0). After transformation, the series oscillates around zero with roughly constant variance, making chronological splitting much more reliable (the model is no longer “blindsided” by a massive spike) while still being stationary.

Show Code
plot_kdtree_network(edge_index, edge_weight, valid_indices)
<networkx.classes.graph.Graph object at 0x31ca7add0>

10.3 Feature Engineering (3.2) in detail

We need to enrich the transformed_rv_pivot (\(T \times N\)) where T is the number of time buckets and N is the number of stocks with more detail to increase the information gain of the data for the GAT model to learn both temporal pattern and short term fluctuations. In order to achieve this we proceeded with the following features (See appendix for more detail on features):

  1. Own-RV lags: RV often exhibits auto-persistence: a high‐volatility bucket tends to be followed by elevated volatility. So for each stock we included the precious three buckets of transformed RV to capture the short-term persistence of volatility.
  2. Volatility Momentum: Beyond raw persistence, we want to capture changes in the short-term trend—for example, if volatility is accelerating or decelerating, this was done by calculating difference between the average RV over most recent and preceding three buckets
  3. Mean reversion tendency: empirically, volatility often reverts toward a longer‐term mean after extreme moves. We calculate the negative deviation of the current RV from its ten‐bucket rolling average, encoding how strongly each stock’s volatility is “pulled back” toward a longer‐term mean.
  4. Volatility of volatility: Some stocks exhibit wild swings in volatility itself (for instance, jumps around earnings). We take the rolling standard deviation of the last five buckets of RV for each stock, quantifying how erratic or “jittery” the volatility itself has been over that short window

10.4 Correlation Heatmap

Show Code
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torch_geometric.data import Data

def extract_and_plot_attention(model, X_test, edge_index, edge_weight, valid_indices, sample_size=20):
    """
    Extract attention weights and create heatmap
    
    Parameters:
    - model: trained GAT model
    - X_test: test features [T, N, F]
    - edge_index: edge connections [2, E]
    - edge_weight: edge weights [E]
    - valid_indices: valid stock indices
    - sample_size: number of time steps to sample
    """
    model.eval()
    
    T, N, F = X_test.shape
    attention_matrix = np.zeros((N, N))
    count_matrix = np.zeros((N, N))
    
    with torch.no_grad():
        for t in range(min(sample_size, T)):
            # Create graph for this time step - move to same device as model
            graph = Data(
                x=torch.tensor(X_test[t], dtype=torch.float).to(next(model.parameters()).device),
                edge_index=edge_index.to(next(model.parameters()).device),
                edge_weight=edge_weight.to(next(model.parameters()).device)
            )
            
            try:
                # Forward pass with attention extraction
                # This modifies the first layer to return attention
                x = graph.x
                h, (edge_idx, attention_weights) = model.conv1(x, graph.edge_index, return_attention_weights=True)
                
                # Convert to numpy
                edge_idx = edge_idx.cpu().numpy()
                attention_weights = attention_weights.cpu().numpy()
                
                # Handle multi-head attention (average across heads)
                if len(attention_weights.shape) > 1:
                    attention_weights = attention_weights.mean(axis=-1)
                
                # Fill attention matrix
                for i, (src, tgt) in enumerate(edge_idx.T):
                    if src < N and tgt < N:
                        attention_matrix[src, tgt] += attention_weights[i]
                        count_matrix[src, tgt] += 1
                        
            except Exception as e:
                print(f"Error at time step {t}: {e}")
                continue
    
    # Average attention weights
    mask = count_matrix > 0
    attention_matrix[mask] = attention_matrix[mask] / count_matrix[mask]
    
    # Create heatmap
    plt.figure(figsize=(12, 10))
    
    # Use subset for cleaner visualization if too many stocks
    if N > 30:
        subset_size = 30
        subset_idx = np.random.choice(N, subset_size, replace=False)
        subset_matrix = attention_matrix[np.ix_(subset_idx, subset_idx)]
        subset_labels = [f'S{valid_indices[i]}' for i in subset_idx]
        title = f"GAT Attention Matrix (Random {subset_size} stocks)"
    else:
        subset_matrix = attention_matrix
        subset_labels = [f'S{valid_indices[i]}' for i in range(N)]
        title = "GAT Attention Matrix (All stocks)"
    
    # Plot heatmap
    sns.heatmap(subset_matrix, 
                xticklabels=subset_labels, 
                yticklabels=subset_labels,
                cmap='Blues', 
                square=True,
                cbar_kws={'label': 'Attention Weight'},
                linewidths=0.1)
    
    plt.title(title, fontsize=14, fontweight='bold')
    plt.xlabel('Target Stock (receives attention)')
    plt.ylabel('Source Stock (gives attention)')
    plt.xticks(rotation=45)
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
    
    # Print some statistics
    non_zero = np.count_nonzero(attention_matrix)
    max_attention = np.max(attention_matrix)
    mean_attention = np.mean(attention_matrix[attention_matrix > 0])
    
    print(f"\nAttention Statistics:")
    print(f"Non-zero attention weights: {non_zero}")
    print(f"Max attention weight: {max_attention:.4f}")
    print(f"Mean attention weight: {mean_attention:.4f}")
    
    return attention_matrix
Show Code
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torch_geometric.data import Data

def load_and_visualize_attention(edge_index, edge_weight, valid_indices, neighbor_indices, rv_pivot, device):
    """    
    Parameters:
    - edge_index: your pre-computed edge connections
    - edge_weight: your pre-computed edge weights
    - valid_indices: your valid stock indices
    - neighbor_indices: your neighbor matrix
    - rv_pivot: your realized volatility pivot table
    - device: your torch device
    """
    checkpoint = torch.load('best_gat_model.pt', map_location=device)
    
    model = ImprovedVolatilityGAT(
        in_feats=7,  # 6 temporal + 1 neighbor feature
        hidden=32,  
        heads=2,
        dropout=0.2
    ).to(device)
    
    # Load weights
    model.load_state_dict(checkpoint['model_state_dict'])
    print(f"Model loaded with validation loss: {checkpoint['val_loss']:.6f}")
    print(f"Model device: {next(model.parameters()).device}")
    
    # Prepare test data (use the same approach as in your evaluation)
    total_time_steps = len(rv_pivot)
    test_start = int(total_time_steps * 0.8)
    test_rv_pivot = rv_pivot.iloc[test_start:]
    
    # Create test features (same as your training pipeline)
    test_transformed_rv = create_stationary_features_fixed(test_rv_pivot)
    X_test_temporal = build_initial_features(test_transformed_rv)
    X_test_neighbor, _ = build_neighbour_feats(test_transformed_rv, neighbor_indices)
    X_test = np.concatenate([X_test_temporal, X_test_neighbor], axis=2)
    
    print(f"Test data shape: {X_test.shape}")
    
    # Make sure edge tensors are on the correct device
    edge_index = edge_index.to(device)
    edge_weight = edge_weight.to(device)
    print(f"Edge tensors moved to device: {edge_index.device}")
    
    # Extract and visualize attention
    attention_matrix = extract_and_plot_attention(
        model, X_test, edge_index, edge_weight, valid_indices, sample_size=20
    )
    
    return model, attention_matrix

model, attention_matrix = load_and_visualize_attention(edge_index, edge_weight, valid_indices, neighbor_indices, rv_pivot, device)
Model loaded with validation loss: 17.936584
Model device: cpu
Input matrix: 766 time points × 112 stocks
Initial features: (765, 112, 6) (T × N × 6 features)
Features created: (765, 112, 1) (T × N × 1 features)
Test data shape: (765, 112, 7)
Edge tensors moved to device: cpu

Attention Statistics:
Non-zero attention weights: 448
Max attention weight: 1.0000
Mean attention weight: 0.2500

Figure 1: Heatmap of inter-stock correlations

References

Andersen, T.G. et al. (2003) “Modeling and forecasting realized volatility,” Econometrica, 71(2), pp. 579–625.
Cont, R. (2001) “Empirical properties of asset returns: Stylized facts and statistical issues,” Quantitative Finance, 1(2), pp. 223–236.
Corsi, F. (2009) “A simple approximate long-memory model of realized volatility,” Journal of Financial Econometrics, 7(2), pp. 174–196.
Governors of the Federal Reserve System, B. of (2022) Changes in u.s. Family finances from 2019 to 2022: Evidence from the survey of consumer finances. Available at: https://www.federalreserve.gov/publications/files/scf22.pdf.
Hull, J. (2015) Options, futures, and other derivatives. 9th ed. Pearson.
Optiver (2021) “Optiver realized volatility prediction challenge.” https://www.kaggle.com/competitions/optiver-realized-volatility-prediction.
Velickovic, P. et al. (2018) “Graph attention networks,” in International conference on learning representations (ICLR).
Yin, H. (2023) “Enhancing directional accuracy in stock closing price value prediction using a direction-integrated MSE loss function,” International Journal of Computational Science and Engineering, 28(1), pp. 1–12. Available at: https://www.scitepress.org/Papers/2023/128102/128102.pdf.
Zhang, Y. et al. (2022) “Stock price movement prediction with graph attention networks,” Expert Systems with Applications, 191, p. 116367.